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ABSTRACT 



Rigorous comparison of the reliability coefficients of several 
tests or measurement procedures requires a sampling theory for the 
coefficients. This paper summarizes the imporcant aspects of the 
sampling theory for Cronbach's (1951) coefficient alpha --a widely 
used internal consistency coefficient. This theory enables researchers 
to test a specific numerical hypothesis about the population alpha and 
to obtain confidence intervals for the population coefficient. It also 
permits researchers to test the hypothesis of equality among several 
coefficients, either under the condition of independent samples or when 
the same sample has been used for all measurements. The procedures are 
illustrated numerically, and the assumption and derivations underlying 
the theory are discussed. 
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Introduction 

When an estimate of the reliability of an educational or psychologi- 
cal instrument is needed and the parallel fornis and test-retest approaches 
are impractical, investigators typically rely on internal consistency 
coefficients. For cognitive tests and affective scales one of the most 
commonly used indices is Cronbach's (1951) coefficient alpha. This co- 
efficient is also frequently employed in settings which involve raters 
or observers (Ebel, 1951). The purpose of this review is to summarize 
the sampling theory for coefficient alpha and to illustrate the uses 
of this theory in evaluating reliability data. 

The experimental problems for which the sampling theory is needed 
include the following: 1) to test the hypothesis that coefficient alpha 
equals a specified value in a given population; 2) to establish a confi- 
dence interval for the alpha coefficient; 3) to test the hypothesis of 
equality for two or more coefficients when the estimates are based on 
independent samples; (4) to test the hypothesis of equality when the 
observed coefficients are based on the same sample and hence are de- 
pendent; and 5) to obtain an unbiased estimate of the population value 
of alpha. 

A test of a specific hypothesis is called for when a revised measure- 
ment procedure is compared to an established, accepted procedure. In most 
such instances this statistical test would involve a directional alter- 
native--that the new procedure is more reliable than the traditional pro- 
cedure. In some applications the test might be two-ended, however. Such 
an alternative might arise when changes in a measurement procedure make 
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administration more efficient, but might affect reliability either posi- 
tively or negatively. 

Studies in which differences among coefficients are of concern to 
investigators are not uncommon. Research on alternative methods of measur- 
ing a specified trait may well call for a test of the equality of alpha 
coefficients for the several niethods. Evaluation of a training program 
designed to enhance inter-rater reliability may also demand a test of this 
null hypothesis. Refinement of a instrument may be assessed, in part, by 
a comparing the reliabilities of severa^ alternative versions. Oaster 
(198A), for example, encountered this situation in the refinement of 
Likert scales. 

These p.oblems of inference require the development of a sampling 
error theory for coefficient alpha. The first steps in this development 
occurred in the early 1960s when Kristof (1963) and Feldt (1965) indepen- 
dently derived a transformation of the sample alpha coefficit.it which is 
proven to be distributed as F. They showed how this result can be tised to 
test hypotheses and generate confidence intervals for a single alpha co- 
efficient. 

Techniques for testing the equality of alpha coefficients were devel- 
oped over the following twenty-year period. The first situation to be 
considered was that of independent coefficients, that is, coefficients ob- 
tained from separate examinee samples. Feldt (1969) derived an F test for 
the two-coefficient case, and seven years later Hakstian and Whalen (1976) 
extended the methodology to any number of coefficients. Dependent or re- 
lated coefficients--reliabilities based on the same sample--posed more 
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complex statistical problems. Feldt (1980) resolved these problems for 
two coefficients; Woodruff and Feldt (in press) completed the cycle with 
a test of equality of m dependent coefficients. In each instance, the 
control of type I error was verified through computer-based Monte Carlo 
studies. 

The present paper synthesizes this statistical theory for Cronbach's 

alpha. The principal objective is to make the procedures accessible to 

researchers and to provide numerical illustrations. For each situation 

the general outlines of the proofs and derivations are presented. 

Inference for a Single 

Alpha Coefficient 

Let denote the score of person £ on item j. The test consists 

of n items or parts and is administered to N subjects. Let Y denote ^he 

P 

total test score for person p, i.e., Y = -X. . The usual formula for 

P J==l JP 

the sample alpha coefficient, which v ' 11 be denoted as C herein, is 



-2 n ^2 

J 



n - 1 



Y 



(1) 

In this formula lepresents the unbiased estimate of the variance for 
j 

item 2, and Oy that for score Y. The sample alpha coefficient is denoted 
by C and its parameter value by Q to avoid confusion with the syri.bolic 
representation of statistical significance levels, almost universally 
denoted in the statistical literature by a. 

Following Hoyt (1941), an alternate formula for C may be derived by 
considering the responses of the N subjects on the n items a.s observations 
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in a two-way subjects- by- items ANOVA with one observation per cell. Within 
this framework, a formula for I is 

I , MS(S) ' MS(SI) ^ ^ _ M£(SI) ^ (2) 
MS(S) MS(S) 

where MS(S) denotes the mean square for subjects and MS(SI) denotes the 
mean square for subjects-by-items interaction. When applied to the setting 
in which n raters evaluate N subjects, C can be used as measure of inter- 
rater agreement, with differences among rater means not considered measure- 
ment error. In such a case, raters substitute for iteir.' in equation (2). 

Let E denote expected value and in particular let the expected values 
for MS(S) and MS(SI) be denoted as E(MS(S)] and E[MS(SI)], respectively. 
The population value of coefficient alpha is defined as 

, _ E[ MS(S)] - E[^MSI)1 _ . E[MS(SI)] . (3) 
E(MS(S)] ^ ■ E[MS(S)] 

Kristof (1963) and Feldt (1965) independently proved that when rhe usual 

assumptions for the two-way random effects (type II) ANOVA are met, the 

following statistic is distributed as F with N - 1 and (n-l)(N-l) degrees 

of freedom: 

1 - C ^ MS(S)/E[KS(S)] . (4) 
1 • C KS(SI)/E(MS(SI)] 

It may also be shown that if 1) items are treated as a fixed factor in 

the two-way ANOVA, 2) the usual assumptions ^0T the two-way mixed mod-jl 

(type III) ANOVA ate met, and 3) there is no items-by-subjects interaction, 

the same F distribution holds for the statistic given in equation (4) 

(Scheffe, 1959, cha- 
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The proof that (1-C)/(1-C) is an F variable follovs from the fact 
that under the assumed ANOVA model MS(S)/E[MS(S) ] is distributed as a 
chi square variable divided by its degrees of freedom, N-1. Likewise, 
MS(SI)/E[MS(SI)] is distributed as a chi square variable divided by its 
degrees of freedom, (n-l)(N-l). Under the assumed model these chi squares 
are independent. Therefore, their ratio (equation A) is distributed as a 
central F with N-1 and (n-l;-'N-l) degrees of freedom. 

This distribution theory for (l-C)/(l-C) may be used to formulate a 

test of a specific numerical hypothesis and derive a confidence interval 

for a population alpha coefficient. To test the null hypothesis H : C C 

o o 

against a two-tailed alternative at the a level of significance, let F(a/2) 
denote the lOOa/2 percentile and F(l-a/2) the 100(l-a/2) percentile of the 
central F with N - 1 and (n-l)(N-l) as its df. The null hypothesis is 
rejected if 

(l-Co) . (1-C ) 

-lum " ' - ■ 

If a one-tailed test at the a level of significance is desired, a/2 is 
replaced by a in the appropriate critical value. 

The upper and lower endpoints of a lOO(l-a) percent interval for 
for C are given respectively by 

C^^ = 1 - [(l-C)F(a/2)] and = I - [(l-C)F(l-a/2) . (6) 
If a one-sided lOO(l-a) percent interval is desired, a/2 is replaced by 
by a in the appropriate endpoint. 

The foregoing results may be illustrated by the following example. 
Suppose a researcher used 41 examinees to obtain an estimate of .790 for 



ERIC 



4 

8 



the alpha coefficient of a 26-item test. The relevant F distribution 
has df = 40 and 1000, for which the fifth and ninety-fifth percentiles 
are 0.66 and 1.41. The 90% confidence interval (bounded below and above) 
has 

= 1 - (1-.;9)(1.41) = ,704 

= 1 - (l-.79)(0.66) = .861 . 

A one-tailed test of H : C = .70, with H C > .>u ::^d a = .05, would 

o o alt o 

require only a lower bound for the critical region. By equation (5), the 

critical region (C.R.) is 

C R > 1 - (l-'70) ^ _ (1--70) . 

^ F(.95) ^ 1.41 -'^^ • 

Since thp observed coefficient alpha of .790 exceeds the lower bound of 

the critical region, = -70 may be rejected. 

The expected value of C» E[C]f and the bias in C can be deduced from 

the fact that (1-C)/(1-C) is also distributed as F, but with (n-l)(N-l) 

and (N-1) degrees of friiedom. Since the expected value of a central F is 

v^/(v^-2)» where is the second df value, 

E[(l-C)/(1-C)] = (N-l)/(N-3), 



and hence 



If follows that 



E[C] = 1 - (1-C)(N - n/(N - iK 



E[C] - ; = 2(C-l)/(N-3). 
Since C < 1 the difference must be negative, and hence ; tends to under- 
estimate Q. This result was first presented by Kristof (1963). 
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The negative bias of C is generally of little consequence unless 
N is small. If N = 50 and C = .70, for example, the expected value of 
C is .687. With N = 100, the expected value is .694. Where an unbiased 
estimate of C is required, it may be obtained by the formula 

I = [(N-3)C/(N-1)] + 2/(N-l). 

Comparison of Alpha Coefficients Obtained 
from Independent Samples 

Rigorous comparisons of alternative test scoring procedures, test 
construction techniques, item formats, item selection strategies, modes of 
test administration, or competing test instruments entail, in part, the 
comparison of reliabilities. The first paper to address this problem was 
published by Feldt (1969), who derived a statistical test of the hypothesis 
H^: = C^. The Feldt approach is based on the test statistic W = (1-^^)/ 
(1-C^). He proved that when the reliability parameters are equal, W is 
distributed as the product of two inuependent central F variables. This 
product, it was shown, could be well approximated by a single F with N^-1 
and N^-1 degrees of freedom. With modern computing equipment it is relative- 
ly simple to determine the probabilicy that a central F will exceed the 
obtained value of If the probability is less than the significance 
level, the hypothesis of equality can be rejected. 

Hakstian and Whalen U976) extended the methodology to the case of m 
coefficients. Their test is based on the normalizing transformation of F 
developed by Paulson (1942) and the fact that (1-C)/(1-C) is distributed 
as F with (N-l)(n-J ) and (N-1) degrees of freedom. Paulson proved that 
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z = 



2_f2/3^ 2 
2 ^'1 



where z is distributed as unit normal deviate. In the present context, 
the transformation may be stated as follows: 



z = 



a-l)^''^ - [(1 - -^-Ki-o^'^ /(I - -^)] 

1 ^^2 



(7) 



18v,(l-C) 

(9v, - ly 



2/3 



18v^(l-C)^/^ 



- 1/3 

This ratio implies that (1-C) is approximately normally distributed with 
non-zen mean (the term in brackets \n equation 7) and variance approximated 



by 



9 

S~ = 



(9v2-2)2 



1 + 



18(N - 1)(1-C) 



2/3 



(9N-11)2 



n 

n-1 



Hakstian and Whalen (1976) propose that the weighted average (y-'O of the 
1/3 

(1-Cj) be obtained, the weights equalling the reciprocals of the vari- 
ances. The test statistic is then defined as 



m 

M = Z 

i k 



(1 - l,^" - u^v 



^ 9 



(8) 



^hich is interpreted as a chi square uith m-1 degrees of freedom. The justi- 
fication for this interpretation is that the sum of m squared standardized 
deviations of normal variables from their weighted mean is so distributed. 
The test is thus analogous to the test of the equality of m correlation 
coefficients, wherein Fisher's transformation of the coefficients has achieved 
normality with variances i/(N^-3). (See Hays, 1981, p. 469.) 
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There are two minor problems with the Kakstian/Whalen test. First, the 



This is contrary to theory and in contrast to the case of transformed 
correlations, in which the variances, l/(N^-3), do not depend upon sample 

1 /o 

estimates. Second, even if all Q, are equal, the statistics (1-^^) do 
not come Trom the same normal distribution unless the bracketed term in 
equation (7) is the same for all tests. This equality demands that 



be constant over all test*?. 

Fortunately, these problems appear to be of little consequence. 
The use of the sample statistic, C^, to replace the parameter, C.» in 
the second term under the radical in the denominator of equation (7) seems 
to have little effect on the distribution of the ratio. The net effect 
might be likened to that of interpreting a t-statistic as a normally dis- 
tributed variable--an interpretation that involves no serious error when 
the sample size is larger than 50. (See Marascuilo, 1966) The minimal 
effect of replacing C by ^ in the SL^cond term probably results from the 
fact that this term is of . rder 2/(9)(N~l)(n- 1 ) . The first term, which 
properly includes C, is of order 2/(9)(N-l). 

The second problem also proves to be of negligible importance by 
virtue of the fact that l-(2/9v^) and l-(2/9v^) are both very close to 
one, regardless the variation in n^ and from test to test. For 
example, if n^ = 50 and = 100 the ratio of these terms is 1.0022. 
If n^ = 10 and = 50, the ratio is 1.0040. Thus, the hypothesis that 



variance of (1-^.) 



? ^l/3 



is an estimate based on the sample value of Q . . 



(1 - 2/9v^)/(l-2/9v2) 
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1.0022(l-Cj)^^^ equpls 1 .OO^Od-C^-'^^^ is essentially a hypothesis that 
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rtoodruff and Feldt ( in press) followed a different line of reasoning 
to artive at a similar test of the null hypothesis for m coefficients. 
They adopt the transformation 1/(1- C)^^'' rather than (l-C)^^"'. A critical 
point in the subsequent derviation is the identification of a chi-square 
distribution (df to be determined) for which the variable x /df has nearly 
the same mean, variance, skewness, and kurtosis as (1-C)/(1-C). The latter 
variable is distributed as F with N-1 and (N-l)(n-l) degrees of freedom. 
The chi-square distribution which best satisfies this requirement takes 
df » - 1, where = (n^.-l )(N^)/(n^+l). Woodruff and Feldt approximate 
the variance of 1/(1-^)^^*^ by the Wilson/Hilferty (1931) normalizing trans- 
formation of a chi-square variable. This leads to the following estimate 

1 /3 

of the variance of 1/(1-^^)' : 

sj = 2/9(N.-l)(l-C^)2/3 . 

Unlike Hakstian and Whalen (l976). Woodruff and Feldt use the arithmetic 

mean of the transformed coefficients: 

m . 

i ^ 

Their test statistic, under the assumption of independent samples, is 

where is the arithmetic meaa of the several va lances, S^. Then, 

2 

under UX^ is approximately distributed as x with m-1 degrees of 
freedom. 
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These two approaches may be illustrdted by the following data: 

Group 1 and Test 1 : = . /84 ; n^ = 5, = 51 ; (l-C^)^^"' = .6 

Group 2 and Test 2 : = .875 ; n^ = 5, = 101 ; (l-Q^)^^-^ = 5 
Group 3 and Test 3 : C3 = -936 ; n^ = 5, = 151 ; (1-C3)^^'^ = 
The Hakstian and Whalen variances equal .0020179, .00069754, and .00029718. 
The weighted average, u-'S equals .4458. The test statistic, interpreted 
as a with df = 2, -equals 23.053. 

The same data, analyzed via the Woodruff and Feldt approach, yields 
variances of .0187056, .0134003, and .0139353, and y = 2.05556. The 
test statistic, also interpreted as a chi-square with df = 2, equals 
22.926. One might expect on the basis of the underlying derviations that 
the Woodruf f/Feldt test would give results that are quite consistent with 
those of the Hakstian/Whalen test, as they were, in this instance. 

If tests of pairwise contrasts amorg the coefficients are warranted 
on the basis of a significant outcome of the omn us test, the pairs can 
be considered by Feldt's (1969) tfist for two coefficients. In the present 
instance all pairs lead to rejection of the null hypothesis. 

Comparison of Alpha Coefficients Obtained 
from the Same Sample 
In some settings it is possible to administer all instruments or 
appx> all procedures to the same sample of N examinees. In such instances 
the coefficients are statistically dependent, and the test of the null 
hypothesis must recognize this dependence. To ignore it is tantamount in 
most applications to adoption of a significance level far more stringent 
than the nominal level. 



ERIC 



14 



12 



The methodology for the case of dependent statistics, like that for 
independent statistics, was first developed for : ~ C9. Feldt 
(1980) derived three procedures for testing this hypothesis. Simulation 
studies indicated that all three procedures control type I error rates 
satisfactorily. Feldt recommended the following test statistic: 



t = df = N-2 (10) 



yf 4(l-C^)(l-C2)(l-o^) 



The squared correlation in the denominator refers the squared co- 
efficient between the two total-test scores for the sample. 

The derivation of this test rests on the fact that if " ^2 * 
(l'Q2^^ (^"^l^ is distributed identically as the ratio of two dependent 
sample variances, each with expected value of 1.0. Pitman (1939) proved 
that the following function of such a ra. 1 is distributed as t with 
df = N-2: 



t = 



(a^/aj)-l 



/N-2 J / [(acy5p(.-p^) 



Thus, the same function of (l-Q^) / (l-Q-^) must be distributed as t with df 
of N - 2. Substitution of (1-C2)/(1-Ci) ^2^^1 ^^^^ expression 
untimately leads, after algebraic simplification, to equation 10. 

Woodruff and Feldt (in press) extended the methodology to the case of 
m dependent coefficients. They considered eleven possible test statistics. 
Extensive Monte Carlo simulation led to three procedures that showed 
excellent control of cype I error and superior power, compared to the 
others. Of these three technqiues, the procedure identified as UX^^ was 
the simplest computationally and is summarized here. 
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As in the case of independent coefficients, Woodruff and Feldt ^^n 

1/3 

press) appi:oxL7\ata the variance of 1/(1-^^) by the quantity 

S| = 2/9(N.-l)(l-q)2^^ . (11) 

However, the test for dependent alphas also demands an approximation 
of the covariance between l/(l-C.)^^"' and 1/(1-C J^^"'. Using the delta 
method of Stuart and Kendall (1977), Woodruff and Feldt derived the 
following p.stimate: 

S.J = 2p^j/9(N - . (12) 

As in the case of two coefficients, p^^ is the square of the sample cor- 
relation between the scores on tests i^ and j. When the tests differ in 

length, then N = N(n-l)/(n+l), where n is the harmonic mean of all test 

1/3 

lengths. On the assumption that the variables 1/(1-^^) have a multi- 
variate normal distribution, a matrix function of C^, p, a^, a^^, and 
N is shown to be distributed approximately as with m-1 degrees of 
freedom. Woodruff and Feldt (in press) further show that an approximation 
of this function serves satisfactorily as a test statistic. It is 

UX^ = Z [(l-C^)'^^-^ - / (S2-C) , (13) 

i 

where is the average of the variances (equation 11), C is the aver- 
age of the covariances S^. (equation 12), and is the average of the 

1/3 

transformed coefficients, 1/(1-C^) . UX^ is distributed approximately 

as chi-sauare with m-1 degrees of freedom when H is true. 

o 

This procedure may be illustrated by a four-test situation with 
N « 100 and the following data: 
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Test 


1 


• "l = 


100 


^1 = 


' .875 


(1 




2.0000 


Test 


2 


: = 


75 




= .857 


(1 




1.9123 


Test 


3 


: n3 = 


50 




■■ .833 


(1 




1.8159 


Test 


4 


"4 = 


25 




■■ .800 


(1 


- 


1.7100 



n = 48.0000 



1.85955 



Correlations! 



1.00 



.80 


.75 


.65 


1.00 


.70 


.60 




1.00 


.55 






1.00 



S? and S . . ; 
1 ij 



.0093648 



.0057306 
.0085615 



.0047828 
.0039836 
.0077201 



.0033829 
.0027561 
.0021991 
.0068459 



§2 - C = .0043172 
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S2 = .0081231 C = .0038059 

UX^ = 10.836 P [x^3j > 10.836] = .013 

For this example, the more complex function for which UX^ substitutes 
has the numerical value 11.118. 

Analogous to the situation involving independent coefficients, 
follow-up tests of pairwise contrasts can be made via the t-test pre- 
sented earlier for two coefficients. 

Cruciality of the Statistical Assumptions 
The most fundamental distributional assumption required by these 
inferential procedures is that the quantity (1-C)/(1-C) be distributed 
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as F. As previously noted, this assumption will be met if the scores 
conform to the dictates of the two-way random effects model (type II 
ANOVA) with one observation per cell or the two-way mixed model (type 
III ANOVAJ with one observation per cell and no interaction effect. 
These requirements will be met if the item or part scores are normally 
distributed with homogeneous error variances. However, these assumptions 
will almost surely be violated if each part of the instrument gives rise 
to a restricted range of scores. Therefore, the question arises how 
well the procedures may be expected to perform with actual data. 

Feldt (1965), in deriving the F distribution for the transformed 
alpha coefficient, gives a detailed discussion of the assumptions required 
under the random effects model and how they might be violated with dichot- 
mously scored items. He also reports on the results of a simulation study 
based on real test data with dichotomously scored items. The results 
indicate that the F distribution holds up well with such data. 

In an experimental design context, Seeger and Gabrielsson (1968) 
simulate the distribution of mean square ratios under the mixed model 
ANOVA when applied to dichotomous data. They consider the situation in- 
volving several observations per cell and focus attention on the F ratio 
pertaining to treatment effects. Though this ratio is not the one used 
in reliability studies, their simulations offer further indirect evidence 
that (1-C)/(1"C) is distributed approximately as F even if the items are 
dichotomously scored. 

Inference for several alpha coefficients based on independent samples 
requires, in addition to distributional assumptions, that the sample sizes 
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be large enough to justify the asymptotic chi square distribution for 
the Hakstian/Whalen test statistic M and the Woodruff /Feldt statistic 
UX2. Hakstian and Whalen used Monte Carlo methods to investigate the 
sampling distribution of test statistic M when computed from dichoto- 
roous part scores. Their results indicate good control of type I error 
rates with as few as twenty subjects per test, even for this gross 
departure from normality and homogeneity of variance. 

If the same sample or matched samples are used for testing the 
equality of several alpha coefficients, two additional assumptions are 
required. The first is that the 1/(1-^.)^^^ have a joinu multivariate 
normal distribution. The second is that the correlations between total 
scores and be identical (homogeneous) for all pairs of tests. If 
the have approximate F distributions, then the 1/(1-^)^^*^ 

have marginal distributions approximately normal in form. Given these 
marginal normal distributions, it is reasonable to assume multivariate 
normality. However, multivariate normality does not automatically follow 
from the condition of marginal normality. 

Woodruff and Feldt (in press) investigated the power and Type I error 
control of UX^ using Monte Carlo methods. They found for a sample size 
as small as 50 and with moderately heterogeneous, positive inter- test 
correlations (range of p equal to .30), control of Type I error rates was 
quite good. However, these simulations were based on continuous, normally 
distributed scores. They did not provide evidence as to the cruciality of 
the normality assumption for total scores nor did they document the effects 
of dichotomous item scoring. 
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The results of subsequent Monte Carlo investigations of these is':ues 
are summarized in the tables which follow. In the first of these studies, 
dichotomous item scores were generated via a computer simulation technique 
described by Nitko (1968). Two true null hypotheses were considered. In 
the first, = .80 for each of four tests with 30 items in each test. 
In the second, = .65 for each of three tests with 30, 30, and 60 items, 
respectively. Each 30-item test exhibited a range of item difficulties 
from .30 to .80; the 60- item test had a range of item difficulties of .35 
to .73. The item difficulty distribution for each test was unimodal and 
symmetrical around the value .55. The resultant distributions of total 
test scores were slightly skewed negatively (y ^ = -.13) and platykurtic 
(^2 ^ ''53), generally similar to the distributions for many standardized 
tests. The inter-test correlations were homogeneous and equal to their 
shared reliability (.65 or .80). 

For each null hypothesis, 2200 simulations of random sample data were 
produced, based on N = 50 and N = 100. Test UX^ was performed on each 
replication, and the percent of test statistics exceeding the upper 10%, 
5%, and 1% points of X^^^.^^ was tabulated. These data are summarized in 
Table 1. 

It may be observed that the UX^ test showed no gross effects from 
dichotomous item scoring. There is a tendency toward liberality if N=50 
and a 10 or 5 percent level is employed, but this deviation from the 
nominal significance level would not disturb most researchers. 

The second empirical study used actual test data--,9Cores of Iowa 
students in grade 9 and 11 on various subtests of the Iowa Tests of 
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Table 1 

Estimated Probability of Type I Error Based 
on 2200 Replications of the UX, Test: 
Simulated Oichotomous Item Scores 





C = .8 


m ■ A 


n. = 30 
1 


C « .65 


m » 3 


n. « 60,30,30 

1 




10% 


5Z 


1% 


10% 


5% 




N=50 


10.7 


5. A 


l.A 


10.0 


5.1 


0.9 


N»100 


10.9 


5.6 


1.1 


9.8 


5.0 


1.1 
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Educational Development , Form X-7. The subtests for grade 9 were selec- 
tively shortened by the deletion of items so that all tests had C = .75. 
The subtests for grade 11 were differentially shortened so that all tests 
had C = .87. From the pool of 16,443 records for grade 9 and 16,760 
records for grade 11, 2,000 random samples of N=50 and 2,000 samples 
of N«100 were chosen by sampling examinees randomly with replacement. 
The UX^ test was then executed on each examinee sample, using m«2, 3, 4, 
or 5 ITED subtests. Within the value of m«3 £nd 4, two groups of sub- 
tests were investigated. The first group exhibited less heterogeneity 
of inter- test correlations than did the second, but in both cases the null 
hypothesis with respect of Q was true. The results of this study are 
siimmarized m Tables 2 and 3. 

With actual test data the control of Type I error was not as tight 
as with simulated dichotomous item scores. The deviations from the nominal 
significance level were most pronounced with N=50, though not consistently 
in the direction of liberality. With m=3, for example, the deviations at 
10% and 57. levels are positive for one group of tests and negative for the 
other. It must be borne in mind, of course, that the standard error of 
a percent in the vicinity of 10% equals about 0.67% with 2000 trials; near 
5% the standard error equals about 0.49%. 

A crude summary over the twelve situations for N=50 and N=100 gives 
rise to t e following averages: 

N«50 10.6% 5.5% 1.3% 
N=100 9.9% 5.0% 1.1% 
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Table 2 

Estimated Probability of type I Error Based 

on 2000 Replications of the UXi Test: 
Actual Dichotomous Item Scores « .75) 



Sample 
Size 


m=2 0^=11,13 and n^=ll,16 
10% dZ iZ 


ni=5 n^=ll,13,16,21,21 
lOZ 5% 1% 


N=50 
N«100 


10.4 5.6 1.2 
9.8 4.7 0.9 


11.9 6.0 1.1 
10.3 3.2 0.8 




m=4 n. =11, 16, 21, 21 .60<p <.67 
A xy 


m«4 n. =11, 13, 16, 21 .55<p ^.65 
1 xy 


N=50 
N=100 


11.4 6.4 1.5 

10.5 5.1 0.9 


9.5 5.3 0.8 
9.0 4.5 0.8 




n =11,16,21 .60<o <.65 
i xy 


m=3 n. =11, 13, 16 .55<p <.65 
1 xy 


N-5C 
N=100 


9.9 4.9 1.3 
9.1 4.0 0.6 


10.4 5.6 0.9 
iO.4 4.1 1.1 
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Table 3 

Estimated Probability of Type I Error Based 
on 2000 Replications of tne UX^ Tests: 
Actual Dichotomous Item Scores (C = ,87) 



Sample 
Size 


m=2 n^=24,34 and n.=28,46 
10% 57. 17. 


m=5 n^=24,28,31,36,46 
10% 57. 17. 


N=50 
N-lOO 


10.4 5.5 l.A 
9.7 5.0 1.2 


10.2 5.2 1.4 
9.7 5.5 1.7 




m=4 n. =24, 36, 46, 28 . 70<p <.79 


m=4 n. =31, 24, 46, 28 .63<p <.79 
1 xy 


N=50 
N=100 


12.4 6.9 1.7 
10.8 5.8 1.5 


10.3 4.7 1.2 
9.8 5.5 1.4 




m=3 n. =11, 16, 34 .60<p <.65 
1 xy 


m=3 n. =31, 24, 46 .63<p <.76 
1 xy 


N=50 
N=100 


11.8 6.0 1.6 
11.0 3.9 1.1 


8.6 4.4 1.2 
8.8 4.3 0.9 
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These means are very close to the averages for simulated data (Table 1). 
Together, they support the conclusion that the UX^ test works quite well 
with N«100, but it errs on the side of liberality with N«50» The degree 
of liberality isn't great, and most researchers would probably be willing 
to accept a test that controls Type I error within one-half of one percent. 
But there is a need for an improved test for use with sample sizes of 50 
or less. It is pertinent to note that almost all of the test instruments 
used in this study gave rise to negatively skewed, platykurtic score 
distributions. The index of skewness ranged between -.597 and +.165, 
with eight of the ten indices negative. The index of kurtosis ranged 
between *.0i5 and -.948. The average value of for all ten tests (five 
in each of two grades) was -.676. Quite possibly this characteristic of 
the score distributions accounts for the liberality of the UX^ test with 
the smaller sample size. 
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